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ENDOGENOUS POST-STRATIFICATION IN SURVEYS: 
CLASSIFYING WITH A SAMPLE-FITTED MODEL 

By F. Jay Breidt 1 and Jean D. Opsomer 2 

Colorado State University 

Post-stratification is frequently used to improve the precision of 
survey estimators when categorical auxiliary information is available 
from sources outside the survey. In natural resource surveys, such in- 
formation is often obtained from remote sensing data, classified into 
categories and displayed as pixel-based maps. These maps may be 
constructed based on classification models fitted to the sample data. 
Post-stratification of the sample data based on categories derived 
from the sample data ("endogenous post-stratification") violates the 
standard post-stratification assumptions that observations are clas- 
sified without error into post-strata, and post-stratum population 
counts are known. Properties of the endogenous post-stratification 
estimator are derived for the case of a sample-fitted generalized lin- 
ear model, from which the post-strata are constructed by dividing the 
range of the model predictions into predetermined intervals. Design 
consistency of the endogenous post-stratification estimator is estab- 
lished under mild conditions. Under a superpopulation model, consis- 
tency and asymptotic normality of the endogenous post-stratification 
estimator are established, showing that it has the same asymptotic 
variance as the traditional post-stratified estimator with fixed strata. 
Simulation experiments demonstrate that the practical effect of first 
fitting a model to the survey data before post-stratifying is small, 
even for relatively small sample sizes. 

1. Introduction. Post-stratification (PS) provides a convenient and in- 
expensive way to improve the precision of estimators in a survey, and is very 
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widely used. In traditional PS, survey observations are classified without er- 
ror into two or more categories, called post-strata, where the corresponding 
population counts in those categories are known from some source outside 
the survey. In surveys of human populations, post-strata are often demo- 
graphic subgroups, with population counts available from a census. In nat- 
ural resource surveys, post-strata may be landcover or -use classifications, 
with population counts obtained from remotely sensed data. 

An important example of a natural resource survey is the Forest Inven- 
tory and Analysis (FIA) program conducted by the U.S. Forest Service (see, 
e.g., Frayer and Furnival [5] for a description). In FIA, data of interest are 
collected annually during intensive field visits and are used to produce offi- 
cial estimates for a large number of forest attributes. In the Interior West 
region of the United States, FIA estimates are computed as PS estimators, 
with the strata defined by homogeneous landuse and groundcover categories 
(e.g., nonforest, broadleaf forest, etc.). Population totals and sample point 
classifications for those categories are obtained from maps, which are main- 
tained in a geographic information system (GIS). These maps are derived 
from satellite imagery and other ancillary data layers. 

Satellite imagery from the Landsat Enhanced Thematic Mapper Plus 
(ETM+) as well as from the Moderate Resolution Imaging Spectroradiome- 
ter (MODIS) is an important source of remotely sensed data for mapping 
vegetation over large geographic extents. These data consist of collections 
of pixel-based maps of physical measurements, such as reflectance values at 
different wavelengths, which cannot immediately be converted into useable 
classifications. Instead, categories are obtained by first "training" a classifi- 
cation algorithm on existing satellite imagery and other ancillary digital in- 
formation, and then predicting the categories of all pixels in the region using 
that algorithm. Because of the multidimensional and often highly nonlinear 
nature of the relationships among the variables, the classification algorithms 
in use today can be quite complex. Examples of such algorithms for forest 
resources are neural nets and expert systems (Moisen and Frescino [7] ) . The 
end result of the classification is a digital (raster) map showing the geo- 
graphical distribution of the classes over a region of interest. This map is 
often an important "deliverable" for the organization producing it, and is 
used by scientists and land managers for a variety of purposes. 

Because of the large sample size, detailed nature and high quality of the 
FIA data, it is attractive to use FIA data to train classification 
algorithms to produce landcover maps. There are numerous local 
as well as nationwide mapping efforts that use FIA data for this purpose. 
Some examples of national efforts include development of the National 
Landcover Data (http://www.epa.gov/mrlc/nlcd.html), Landfire 
(http://www.landfire.gov/), and FIA's forest type mapping (Ruefenacht, 
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Moisen and Blackard [10]). Questions have been raised about the appropri- 
ate use of these maps in FIA's PS estimation process (Scott et al. [12]), 
because the post-strata are delimited with error (since they are based on 
a model fit) and depend on the sample observations themselves. This vio- 
lates two fundamental assumptions of traditional PS: the exact post-stratum 
counts for the population are unknown, and the classification of the sam- 
ple observations into the post-strata is imperfect. Therefore, it is not clear 
whether the resulting estimator continues to be consistent and whether the 
traditional variance estimator remains valid. 

We explore the statistical properties of survey estimators that are post- 
stratified based on a model fitted to the sample observations. To emphasize 
the relationship between the survey data and the stratification, we will refer 
to such an estimator as an endogenous post-stratification estimator, or EPSE 
for short. The EPSE is useful in practice whenever population information 
to construct traditional post-strata is not available, but predictions from a 
sample-fitted classification model can be generated for the entire population. 
We restrict our attention to classification schemes based on parametrically 
specified generalized linear models (McCullagh and Nelder [6]). Some of our 
results will be further restricted to the case of equal-probability sampling, 
as is used in much of FIA. 

An alternative to the EPSE approach is to construct a regression estima- 
tor, using the available auxiliary variables as regressors. This can be done 
using linear models as in the generalized regression estimation (GREG) 
approach (Cassel, Sarndal and Wretman [3]), nonlinear models (Wu and 
Sitter [14]) or nonparametric models (Breidt and Opsomer [2] and Breidt, 
Claeskens and Opsomer [1]). Since these models use the auxiliary variables 
directly, instead of relying only on a classification based on these variables, 
a properly constructed regression estimator might be more efficient than 
the EPSE and would have known design properties. However, there are a 
number of reasons why the EPSE could still be preferable in practice. 

First, suitable classification algorithms have already been developed (in- 
volving extensive variable selection, model validation and calibration) and 
maps with well-defined categories are being produced. These maps synthe- 
size information from many layers of geospatial data, so it is operationally 
efficient to use the generated categories in other estimation problems, rather 
than building new regression models. Further, categories in the classification 
can often be readily interpreted (e.g., forest/nonforest), whereas the remote 
sensing variables (e.g., reflectance at a specific wavelength) are not, so that it 
is easier to explain the estimation procedure and the resulting fits to diverse 
end users. 

Second, both maps and survey estimates are typically generated by the 
same organization, so it is clearly desirable to ensure that the survey es- 
timates are calibrated to the map "control" totals. This is automatically 
achieved under the EPSE approach, but not with a regression estimator. 
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Third, PS weights based on a modest number of classes may tend to be 
more stable than the weights obtained from regression estimation, especially 
in cases where many potentially correlated variables are used in the regres- 
sion model. In particular, PS weights are guaranteed to be nonnegative, 
while regression weights are not. Negative weights are an especially seri- 
ous consideration if survey data are to be used in model fitting, with many 
statistical programs unable to properly operate in the presence of negative 
weights. 

Finally, the EPSE estimator is robust in the sense that it can compete with 
the regression estimator when the regression model is correctly specified, 
and can dominate the regression estimator when the regression model is 
misspecified. 

The EPSE is defined in Section 2 and its properties are described in Sec- 
tion 3, first under a general probability sampling design and then under 
a superpopulation model. Section 4 describes simulation experiments per- 
formed to assess the practical consequences of endogenous PS in a design- 
based context, and closes with a brief discussion. Proofs are provided in the 
Appendix. 

2. Notation and definitions. 

2.1. Post-stratification. Consider a finite population Un = {1, N}. 
For each i £ £/jv> an auxiliary vector x» is observed. A probability sam- 
ple s of size n is drawn from Un according to a sampling design pn(-), 
where Pn( s ) is the probability of drawing the sample s. Assume ttin = 
Pr{i £ s} = J2s:i&sPN(s) > for all i £ Un, and define it^n = Pr{i, j 6s} = 
J2s-.i,jesPN(s) for all i,j £ Un- For compactness of notation we will suppress 
the subscript N and write m, 7Tjj in what follows. Various study variables, 
generically denoted yi, are observed for i £ s. 

We now introduce some nonstandard notation for PS that will be useful 
in our later discussion of endogenous PS. Using the {xi}i^u N and a known 
vector A, a scalar index {m(A / Xj)}j £ {7 JV is constructed and used to partition 
Un into H strata according to predetermined stratum boundaries — oo < 
ro < n < • • • < TH-i < th < oo. Choice of these boundaries is discussed in 
Section 3.3 below. 

For exponents £ = 0,1,2 and stratum indices h = 1, . . . , H, we define 

(1) A NM {X) = J^J2 yi / {r h _ 1 <m(A'x I )<r h } 

i€U N 

and 

( 2 ) A *NheW = Jr yi^^" / K-i<m(A'x l )<rh}' 
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where I{c} = 1 if the event C occurs, and zero otherwise. In this notation, 
stratum h has population stratum proportion Aj^^q^X), design- weighted 
sample post-stratum proportion A^ h0 (X), and design- weighted sample post- 
stratum y-mean A* Nhl (X)/A* Nh0 (X). The traditional design- weighted PS es- 
timator (PSE) for the population mean y^ = iV _1 J2ieu N Vi 1S then 

a;(a) = e^ (a)4Hit 

h =i ^Nhoy*) 

(3) 

< }^A Nh0 {X) \Vi = 2^v>i.WVi> 

where the sample-dependent weights {w* s (X)}i €s do not depend on 
and so can be used for any study variable. 

For the important special case of equal-probability designs, in which m = 
nN , we write 

( 4 ) A nM (X) = - E2/i^{r, 1 _ 1 <m(A'x,)<r, l }- 

In this case, the equal-probability PSE for the population mean y^ is 

(5) Ay (A) = ^WA) ^ nM A = X) W isWVi, 

h=l A nh0(*) i£s 

where the weights {wi s (X)}i^ s are obtained by substituting nN^ 1 for 7Tj in 
(3). 

2.2. Classification based on a generalized linear model. The notation 
introduced above does not indicate how the function m(-) might be con- 
structed, nor how values for the parameter vector A should be determined. 
One possibility is to suppose that a particular study variable, Zi, follows a 
generalized linear model 

(6) E(zi\x.i) =m(A'xj), Var^x*) = «(x»), 

where the expectations are with respect to the model. [For concreteness, 
think of forest /nonforest indicator, with logistic mean function 

m(A'xj) = exp(A / Xj)/{l + exp(A / Xj)}, and Xj derived from satellite imagery] 
We will refer to z% as the PS variable. 

If A were known, we could use m(A'xj) as an index to form the PSE in 
(3) for any study variable j/j (even though the PS is based on a single PS 
variable Zi, the resulting weights can be applied to any response variable 
Pi). If model (6) is true, then m(A'xj) is a good predictor for Zi. Hence, 
the estimator (3) applied to the study variable Zi will be more efficient 
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than the Horvitz-Thompson estimator, = N J2ies n i~ z i> which ignores 
the auxiliary variables Xj. For other study variables yi, the efficiency of (3) 
relative to y n = N^ 1 J2i£s 7T ^ 1 yi wm depend on the relationship between the 
PS variable Z{ and the yi. 

2.3. Endogenous post-stratification. In endogenous PS, the vector A is 
unknown, so that estimator (3) is infeasible. Instead, A is estimated from 
the sample {(x^,Zj):i S s} by A using, for instance, maximum likelihood 
estimation and for any i GUn, z% is predicted by Z{ = m(A'xj). 

The endogenous post-stratification estimator (EPSE) for the population 
mean y^ is then defined as 

(7) A; (A) = £ A Nh0 (\)-^*l = £ <{X ) yi . 

h=l A NhO\*) i£s 

As with the PSE weights, the EPSE weights {w* s (X)}i^ s as defined in (7) can 
be applied to any study variable y. In the special case of equal-probability 
designs, we write 

(8) A, (A) = E A Nh0 (\)^£ = £ w. ls (\)m- 

h=l A nh0{A) iGs 

Intuitively, it is reasonable to expect that if A is a "good" estimator for 
A, then the estimator (7) will behave like the estimator (3), at least asymp- 
totically. We show this equivalence in the sense of design consistency under 
mild design assumptions in the next section. Such results do not readily yield 
rates of convergence, because Ay (A) is not a differentiably smooth function 
of A, so that traditional Taylor series approaches for the analysis of nonlinear 
survey estimators (e.g., Sarndal et al. [11], Chapter 5) cannot be applied. We 
therefore restrict our attention to the equal-probability case and study the 
model-based properties of A?/(A), by exploiting the fact that the model ex- 
pectations of the quantities in (1) and (4) are smooth functions, even though 
the quantities themselves are not. In particular, we establish a central limit 
theorem and a consistent variance estimator for As/ (A) under an assumed 
superpopulation model. Section 4 provides simulation evidence that these 
good model properties also carry over into good design properties. 

3. Main results. 

3.1. Design assumptions and design consistency. We assume the general 
probability sampling design described in Section 2.1 and consider an asymp- 
totic framework in which iV — > oo while the number of strata, H, and their 
boundaries, {t/,,}, remain fixed. Assume: 
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Dl. The covariates {x^} satisfy ||xj|| < M < oo. For A ^ 0, the empirical 
distribution function Gn\{z) = TV -1 J2ieu N I{x'.\<z} converges uniformly 
in z to a limit G\(z), limjv->oo sup 2 \Gn\{z) — G\(z)\ = 0, where the limit 
is almost sure if the covariates are stochastic. 

D2. The link m(-) is a known, strictly monotone function, A 7^ is an 
unknown parameter vector, and m _1 (r/ l ) (h = 1,2,..., If) are continu- 
ity points of G\{z). Further, G\{m~ l (Th)) — G \{m~ l {t^i)) > for h = 
1,2,. ..,H. 

D3. There is a sequence of estimators of A, {A}, with the property that 
for every e > 0, there exists 5 e € (0, 00) such that Pr{|| A — A|| > 5 £ } < e for 
all N, where the probability is with respect to the sampling design and 
the covariate model. 

D4. For all N, minjgj/^ 7Tj > it* N > where Ntt* n — > 00, and there exists 
K > such that 7V 1 / 2+K (7r^) 2 -> 00 and 

max 



V A 2 , = 0(N~ 



as N — > 00, where Ay 



D5. The study variables {yi}ieU N satisfy limsupjv^^ N 1 T,ieU N Vi < 00 • 



Remarks. 

1. Note that no stochastic model is assumed for the {yi} in this design-based 
setting. Randomness comes from the probability mechanism that selects 
s, and possibly from the process generating Xj. 

2. The uniform convergence in Dl is met by independent and identically 
distributed sequences (Glivenko-Cantelli lemma), stationary ergodic se- 
quences (Tucker (year?)), certain deterministic sequences [e.g., polyno- 
mials of the form x-A = Y^=o an d so forth. 

3. D2 ensures that the post-strata are nonempty and can be unambiguously 
determined from the inverse link; D3 asserts that the parameters in the 
generalized linear model can be estimated consistently. 

4. The first part of D4 allows for sparse sampling in the sense that 
mhijgfy^ 7Tj — ► is allowed as N — > 00. The second part of D4 allows 
for nontrivial dependencies in the sampling. Sparser sampling is possible 
under weaker design dependence. For example, under simple random sam- 
pling without replacement maxjg^ J2jeU N -.j^i^-ij = (N - 1)" 1 (n / N) 2 (1 — 
n/N) 2 = OiN" 1 ) so that D4 holds with k = 1/2. On the other hand, 
consider single-stage cluster sampling of m equally sized clusters from M 
clusters via simple random sampling without replacement. All elements 
in each selected cluster are observed. Let c denote the cluster size, and 
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assume it is fixed as cm = n — > oo and cM = N — > oo. Then 



j£U N :j^i 



m I _ m 



M(M - 1) V M 



so that D4 holds with k = 0. Note that the corresponding design-covariance 
assumptions A5 in Robinson and Sarndal [9] and A6 in Breidt and Op- 
somer [2] are not met in general for this design. The Horvitz-Thompson 
estimator is mean square consistent under D4 and D5, a result of inde- 
pendent interest that is established as a lemma in the Appendix. 

Result 1. Assume D1-D5. Then the unequal-probability EPSE in (7) 
is design consistent in the sense that for all e > 0, 

Pr{|/i*(A) - y N \ > e} -> as 00. 
The proof is deferred to the Appendix. 

3.2. Superpopulation model assumptions. To study further the proper- 
ties of EPSE, we restrict attention to equal-probability designs and intro- 
duce a superpopulation model, which specifies the joint distribution of the 
random vector (x^,j/j), while the randomness of s is not explicitly consid- 
ered. In what follows, all expectation, probability, and order in probability 
statements are with respect to this superpopulation model. We continue to 
consider an asymptotic framework in which n, N — > 00 while the number of 
strata, H, and their boundaries, {t^}, remain fixed. Our proofs rely on the 
approach of Randies [8] . Formally, we assume the following: 

• Ml. The covariates {x^} are independent and identically distributed (i.i.d.) 
random p- vectors with nondegenerate continuous joint probability density 
function / and compact support. 

• M2. The link m(-) is a known, strictly monotone function on its domain, 
A ^ is an unknown parameter vector and v(-) in (6) is a bounded, 
positive function. 

• M3. There is a sequence of estimators of A, {A}, such that A — A = 
O p {n~ 1 / 2 ). 

• M4. The study variables yi|xj are conditionally independent random vari- 
ables with E(y^|xj) < K\ < 00. Also, 

(9) a he (X) =V(yil{ Th _ 1 <m(\'x i )<T h }) 
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is continuous in A for £ = 0, 1, 2, and a^o(A) > for h = 1, . . . , H. In par- 
ticular, the variables conditionally independent random variables 
with 

E(zi|xj) = m(A'xi), Var(zj|xj) = u(xj), E(z t 4 |x.j) < K\ < oo. 

• M5. The sample s is selected according to an equal-probability design of 
fixed size n, with 7Tj = niV" 1 — > 7r £ [0, 1] as n,iV-> oo. 

While the conditional independence in M4 rules out certain clustered de- 
signs, it seems quite plausible in large-scale natural resource surveys, where 
it is often the case that sampling locations are widely dispersed and, after 
correcting for covariates, no spatial dependence remains. (We investigate 
the effect of residual spatial dependence via simulation in Section 4.) The 
equal-probability design assumed in M5 is also somewhat limiting, but it 
does cover the systematic designs used by the U.S. Forest Service in FIA. 
Further, our results extend trivially to the case of a fixed number of design 
strata (determined prior to sampling, unlike post-strata) with a large equal- 
probability sample within each stratum, and possibly unequal probabilities 
across strata. 

3.3. Central limit theorem. The proof of consistency and asymptotic nor- 
mality for the EPSE in (8) with respect to the superpopulation model is 
deferred to the Appendix. 

Result 2. Under assumptions M1-M5, 

{^(i - ^)}" V2 (Ay(A) -m) ±M(0,V yX ), 

where 

H 

V y x = Pr { T /i-i < m(X'xi) < T h }V&T{yi\T h -i < m(A'x») < r h ). 

h=l 

Remarks. 

1. When H = 1, the asymptotic model variance of %(A) — from Result 
2 is 

(io) K i_ ^) var(yi) - 

This is the model variance oly^ — y^ under any equal-probability design, 
or the mo del- aver aged design variance of the ordinary sample mean under 
simple random sampling without replacement. 
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2. For general H, the asymptotic model variance in Result 2 is equal to that 
of the traditional post-stratified estimator (i.e., in which A is known). This 
variance has an intuitive form: it sums stratum fraction times within- 
stratum variance of y over the post-strata. Note that 

E «*>(*) > (E ^o(A)^V = (EN) 2 , 

^ \a hQ {\)J \^ a h0 {\)J 

with equality only if the post-stratum means are all identical: a^i(A) /ctho(X) = 
E(yj) for h = 1, . . . , H. Thus, we have that 

Var( yi ) = E afco(A)2^ - (Efe] ) 2 

^ \aw)(A) Vaho(A)/ / 

= E PT i T h-i < m(A'xi) < 77J Var(yj|r ft _i < m(A'xi) < r h ), 
h=l 

so that unless the post-stratum means of y are all identical, the EPSE will 
be asymptotically more efficient than the Horvitz-Thompson estimator 
(the ordinary sample mean in this case). 

3. For the PS variable Zi, it can be shown that 

H 

V zX = E[v(xi)] + E Prfo-i < m(A'xi) < r h } 

(11) 

x Var(m(A'x i )|rft_i < m(A'xj) < r h ) 

with u(xj) = Var(zj|xj). A lower bound for the asymptotic model variance 
of A«(A) — £/v is given by 

1 ( n\ . , „ 

-(i-^)e W x,)], 

which is also the asymptotic model variance of the error of the nonlinear 
regression estimator 

(12) fj z (\) = N- 1 E mCA'xi) + n" 1 " m(A'*i)). 

Hence, the quantity 

i(l-^)(^A-EK Xi )]) 
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measures the asymptotic loss in efficiency of the EPSE relative to the re- 
gression estimator 7)2 (A). The EPSE will be as asymptotically efficient as 
the regression estimator (12), that is, V z \ = E[i;(Xj)], if the m(A'xj) are 
constant within each stratum. If this is not the case, the EPSE will fail to 
match the asymptotic efficiency of the regression estimator. It should be 
noted that although the EPSE for the study variable z is therefore likely 
to be dominated by the regression estimator (12), this is not necessarily 
the case if a different regression estimator is used. For other study vari- 
ables y, the EPSE may be better than a regression estimator, depending 
on the relationship between y and x in the population. We explore this 
further in the simulation study in the next section. 
4. In some applications, it might be of interest to select the {r^} defining the 
categories to improve the efficiency of the EPSE for a "target" variable z. 
As noted in Section 1 , these class boundaries are often determined by the 
requirements of the classification algorithm and the desired map output to 
which the EPSE is calibrated, so that little choice might be available when 
they are applied in the construction of the post-strata, except for possibly 
collapsing neighboring post-strata in case of small sample sizes. If the 
operational environment allows for the selection of stratum boundaries, 
then boundaries might be constructed by applying the cumulative root- 
density method described in Cochran [4], Section 5A.7, to the m(A'xj), 
though this method requires further study. 

3.4. Variance estimation. We now consider variance estimation for the 
EPSE. The standard post-stratified design variance estimator under simple 
random sampling without replacement is 



%(A))4f^)E ( ^ 1)2 



(13) 



n V N '£^l n h n ~ 



„2 
b yh 




" A% h0 (X) A nh2 (\) - Al hl (\)/A nh0 W 
A 

h=l 



' AnhoiX) A nh0 (X)-n- 



where and rih are population and sample counts within post-stratum h, 
and s^ h is the sample y-variance within post-stratum h (see, e.g., Sarndal 
et al. [11], equation (7.6.5)). 

The next result shows that the analogous estimator under endogenous 
PS consistently estimates the asymptotic model variance of the EPSE. The 
proof is again deferred to the Appendix. 



Result 3. Let 



ri4) V(ti (X)) = -(l n \\ V A ^o (A) Anh2W Z A 2 nhl W/A nh0 (X) 
V n\ Nj\^ lAnh0 (X) A nh0 (X)-n-i 



12 



F. J. BREIDT AND J. D. OPSOMER 



Under assumptions M1-M5, as n, iV — > oo 



4. Simulations. The two main goals of the simulation are to assess the 
design efficiency of the EPSE and the design bias of the variance estimator 
(14); we also look at confidence interval coverage. The simulations are per- 
formed in a setting that mimics a real survey, in which characteristics of mul- 
tiple study variables are estimated using one set of weights. The weights for 
estimation of a mean are the Horvitz-Thompson estimator (HTE) weights 
{n _1 }j es , the PSE weights {w is (\)} ies , the EPSE weights {w is (X)} ies , or 
the simple linear regression (REG) weights (e.g., (6.5.12) of Sarndal et al. 
[11]). The HTE does not use auxiliary information; the PSE uses auxiliary 
information with a known model; and the EPSE and REG use auxiliary 
information with fitted models. 

We consider two different models for the PS variable, z%. First, we look at 
the situation in which the true model in (6) for the PS variable is continuous 
and follows a ratio model (see, e.g., Sarndal et al. [11], page 226), so that 
m(-) is the identity function. Second, we consider the case where the PS 
variable is binary and m(-) is the logistic link. 

4.1. Ratio model post- stratification. We first describe the simulation setup 
for the ratio model PS. We assume a population of size N = 1000 with eight 
survey variables of interest. For the PS variable Zi, we let E(zj |x) = m(Ax) = 
1 + 2{x — 0.5) and Var(zj|x) = v(x) = 2o~ 2 x, while for the remaining seven 
variables (yi), we take their mean functions to be equal to g^, 



quadratic: 


giip) 


= l + 2(x-0.5) 2 , 


bump: 


92{x) 


= 1 + 2{x - 0.5) + exp(-200(x - 0.5) 2 ), 


jump: 


53 0*0 


= {1 + 2(x - 0.5)}I {x < 0M5} + 0.65I {x>0S5} 


exponential: 


9i{x) 


= exp(— 8x), 


cycle 1: 




= 2 + sin(27rx), 


cycle 2: 


9a 0*0 


= 2 + sin(87rx), 


white noise: 


97 (x) 


= 8, 



and variance equal to a 2 , with x uniformly distributed on (0,1) and the 
errors for all functions independent and normally distributed. The variance 
function for the PS variable is chosen so that, averaging over the covariate x, 
we have E[-u(x)] = a 2 . Thus, the PS variable and the remaining seven study 
variables all have the same variance, averaged over x. We considered two 
different values of a: 0.25 and 0.50. 

For each noise level, we fixed the population (i.e., simulated N values for 
each of the eight variables of interest) and drew 1000 replicate samples of 
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size n = 50 and n = 100, each via simple random sampling without replace- 
ment from this fixed population. We constructed HTE and REG weights 
using standard methods. We used the model for the PS variable for con- 
structing the PSE weights, using known parameter values, and the EPSE 
weights, using fitted parameter values. The weights were then applied to 
the remaining seven study variables. Hence, the PS model will only be cor- 
rectly specified for the PS variable. The EPSE and PSE were calculated 
using two strata with boundaries r = (— oo, 1.0, oo), using four strata with 
boundaries r = (— oo,0.5, 1.0, 1.5, oo), and using six strata with boundaries 
r = (-oo, 0.5, 0.75, 1.0, 1.25, 1.5, oo). 

Table 1 summarizes the design efficiency results as ratios of the MSE of 
the HTE, PSE(fT), or REG over the MSE of the EPSE(#), where H = 2,4 
or 6 strata. Overall, the results show that the EPSE behaves as expected: it 
produces a large improvement in efficiency relative to the HTE for the vari- 
able on which the PS is based, as well as for most of the other variables that 
are correlated with it. When the number of strata increases, the efficiency 
gains become more pronounced, though EPSE begins to break down due to 
post-stratum sample sizes of zero or 1 when the number of strata is large 
and the sample size is small. When the relationship between the variables 
of interest and the auxiliary variable becomes less strong (i.e., higher noise 
levels), the efficiency gains of EPSE decrease. EPSE is typically as good as 
or better than REG for study variables on which the regression model is 
badly misspecified, but loses out to REG when the true model is linear or 
nearly so ("bump"). The "white noise" variable shows that, when a variable 
is not related to the stratification variable, the efficiency is near that of the 
HTE, but with decreasing efficiency as the number of strata increases (since 
the strata are entirely unnecessary). 

Table 1 also shows that the EPSE is essentially equivalent to the PSE in 
terms of design efficiency, even for n = 50, implying that the effect of basing 
the PS on a fitted model instead of on exogenous strata is negligible for 
moderate to large sample sizes. 

Next, we consider the variance estimator proposed in (14) by computing 
percentage relative biases (100% times the variance bias divided by the true 
design variance) for the PSE variance estimator (13) and the EPSE variance 
estimator (14). These results (not tabled) show that neither estimator is 
unbiased, and both tend to show negative bias (147 of the 192 cases in Table 
1). The bias of the EPSE variance estimator tracks that of the PSE variance 
estimator closely for low noise, low number of strata and large sample size, 
but the tracking deteriorates as noise increases, number of strata increases 
or sample size decreases. 

Finally, we assess the quality of the normal approximation by constructing 
approximate 95% confidence intervals from the pivotal quantity in Result 3. 
These confidence intervals, fi y (\) ± 1.96{V r (/ij / (A))} 1 / 2 , attained empirical 



Table 1 

Ratio of MSE of Horvitz-Thompson (HTE), post-stratification on H strata /PSE(-ff)/ and linear regression (REG) 
estimators to MSE of endogenous post- stratification estimator on H strata [EPSE(H)J 



Response 
Variable 


a" 


ft 




EPSE(2) versus 






EPSE(4) versus 






EPSE(6) versus 




HTE 


PSE(2) 


REG 


HTE 


PSE(4) 


REG 


HTE 


PSE(6) 


REG 


PS 


0.25 


50 


2.76 


1.06 


0.42 


4.57 


1.02 


0.70 


4.95 


0.99 


0.73 


variable 


0.25 


100 


2.46 


1.01 


0.41 


4.36 


1.01 


0.72 


4.75 


1.01 


0.78 




0.50 


50 


1.78 


1.04 


0.74 


2.14 


1.01 


0.87 


2.17 


1.01 


0.86 




0.50 


100 


1.65 


1.04 


0.74 


2.01 


1.02 


0.90 


2.00 


1.01 


0.90 


quad 


0.25 


50 


0.94 


1.00 


1.01 


1.09 


1.01 


1.18 


1.04 


0.98 


1.10 




0.25 


100 


0.93 


1.00 


1.00 


1.10 


1.01 


1.19 


1.08 


0.99 


1.16 




0.50 


50 


0.95 


1.00 


1.00 


0.94 


1.00 


1.00 


0.89 


0.99 


0.94 




0.50 


100 


0.93 


1.00 


0.99 


0.96 


0.99 


1.02 


0.93 


0.98 


1.00 


bump 


0.25 


50 


2.17 


0.98 


0.70 


3.13 


0.97 


1.01 


4.29 


0.99 


1.35 




0.25 


100 


2.16 


0.98 


0.69 


3.21 


0.97 


1.02 


4.20 


0.98 


1.33 




0.50 


50 


1.55 


0.97 


0.83 


1.84 


0.99 


0.98 


2.06 


1.00 


1.10 




0.50 


100 


1.57 


0.98 


0.84 


1.86 


0.99 


1.01 


2.09 


1.00 


1.13 


jump 


0.25 


50 


1.09 


0.99 


0.99 


1.54 


0.99 


1.40 


1.79 


0.97 


1.60 




0.25 


100 


1.10 


0.99 


1.00 


1.55 


0.98 


1.41 


1.83 


0.99 


1.66 




0.50 


50 


1.01 


1.00 


0.99 


1.13 


1.01 


1.11 


1.12 


0.97 


1.10 




0.50 


100 


1.00 


0.99 


1.00 


1.14 


1.00 


1.14 


1.17 


0.98 


1.17 


expon 


0.25 


50 


1.12 


1.01 


0.88 


1.26 


1.01 


1.00 


1.19 


0.99 


0.94 




0.25 


100 


1.08 


1.00 


0.85 


1.31 


1.00 


1.04 


1.29 


0.99 


1.02 




0.50 


50 


1.02 


1.00 


0.96 


1.01 


1.01 


0.95 


0.94 


0.99 


0.90 




0.50 


100 


0.98 


1.01 


0.94 


1.03 


0.99 


0.98 


1.01 


0.98 


0.96 



Table 1 
( Continued) 



Response 
Variable 


a 


n 




EPSE(2) versus 






EPSE(4) versus 






EPSE(6) versus 




HTE 


PSE(2) 


REG 


HTE 


PSE(4) 


REG 


HTE 


PSE(6) 


REG 


cycle 1 


0.25 


50 


3.59 


1.02 


1.67 


3.57 


1.04 


1.66 


4.84 


1.00 


2.09 




0.25 


100 


3.77 


1.00 


1.65 


3.73 


1.00 


1.64 


4.82 


1.01 


2.11 




0.50 


50 


2.11 


0.98 


1.28 


2.14 


1.04 


1.30 


2.33 


0.99 


1.35 




0.50 


100 


2.23 


1.01 


1.28 


2.23 


1.02 


1.29 


2.43 


1.00 


1.40 


cycle 4 


0.25 


50 


1.05 


1.00 


0.98 


1.00 


1.00 


0.93 


1.40 


0.95 


1.28 




0.25 


100 


1.07 


1.00 


0.97 


1.05 


1.01 


0.95 


1.59 


0.98 


1.45 




0.50 


50 


1.02 


1.01 


0.98 


0.97 


1.00 


0.93 


1.12 


0.89 


1.06 




0.50 


100 


1.06 


1.01 


0.98 


1.04 


1.02 


0.96 


1.26 


0.89 


1.16 


white 


0.25 


50 


0.96 


1.00 


1.00 


0.91 


1.00 


0.94 


0.86 


0.98 


0.89 


noise 


0.25 


100 


0.94 


1.00 


0.99 


0.92 


1.00 


0.98 


0.90 


0.99 


0.96 




0.50 


50 


0.96 


1.00 


1.00 


0.90 


1.00 


0.94 


0.85 


0.98 


0.89 




0.50 


100 


0.94 


1.00 


0.99 


0.92 


0.99 


0.97 


0.90 


0.98 


0.95 



Numbers greater than 1 favor EPSE. Based on ratio model post-stratification in 1000 replications of simple random sampling from a 
fixed population of size N — 1000. Replications in which at least one stratum had fewer than two samples are omitted from the summary: 
55 reps for six strata at n — 50, a — 0.25; 58 reps for six strata at n = 50, a — 0.5; and 3 reps for four strata at n = 50, a = 0.5. 
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coverages (not tabled) ranging from 92.1% to 95.8% for the 96 combinations 
of noise level, sample size, number of strata and study variable. These em- 
pirical coverages track closely the empirical coverages of confidence intervals 
constructed from the PSE. 

We repeated the experiments with a = 0.25, n = 50, and H = 2,4 or 6 
strata for the case with Xi = i/(N + 1) (i = 1,2,..., N) and with the residuals 
for every variable autocorrelated: corr(zi,Zj) = corr(yj, yj) = 0.99' l ~ jf L This 
setting clearly violates the conditional independence assumption of M4. The 
results (not tabled) indicate that the EPSE remains essentially unbiased and 
its confidence intervals continue to have close to nominal coverage (92.9%- 
95.6%). Efficiency compared to HTE is even greater than in the condition- 
ally independent case, because positive autocorrelation is trend-like behavior 
that is captured to some extent by the post-strata. The variance estimator 
tends to have less negative bias or more positive bias than in the conditional 
independence case. Overall, these limited simulations suggest that EPSE 
maintains its good behavior outside of the limited setting described in the 
technical assumptions of Section 3. 

4.2. Logistic model post-stratification. Since the theory of this paper cov- 
ers generalized linear models, the above simulation experiments were re- 
peated after replacing the ratio model for the PS variable by a logistic model, 
but keeping all other models the same. In this case, the PS variable Z{ is now 
a Bernoulli variable with expectation m(x) = exp(Ao + Aix)/(1 + exp(Ao + 
Xix)). The values for (Ao,Ai) were chosen as (—10,20) for the "low noise" 
case (a = 0.25 for the remaining variables) and as (—3, 6) for the "high noise" 
case (<t = 0.50 for the other variables). These levels will be denoted as the 
"steep" and the "flat" model. Two, four and six equal-size strata partition- 
ing [0, 1] are considered for the PSE and EPSE. All estimators remain as in 
Section 4.1. 

Table 2 displays the relative efficiency for the logistic model simulations 
using n = 200 (the logistic fits were problematic at smaller sample sizes) . The 
findings are very similar to those discussed for the ratio model. The EPSE 
continues to improve substantially over the HTE for most variables, while 
not deviating substantially from the PSE with known stratum classifications. 
Further, the EPSE continues to be competitive with the REG estimator. 
Efficiency tends to increase from two to four strata, but level off from four 
to six strata. 

Approximate 95% confidence intervals computed from the pivotal quan- 
tity in Result 3 attained empirical coverages (not tabled) ranging from 93.7% 
to 96.3% for the 48 combinations of model, number of strata and study 
variable in Table 2. These empirical coverages track closely the empirical 
coverages of confidence intervals constructed from the PSE, and are quite 
close to nominal in spite of finite-sample bias in the variance estimator. 



Table 2 

Ratio of MSE of Horvitz-Thompson (HTE), post-stratification on H strata [PSE(H)] and linear regression (REG) estimators to MSE 

of endogenous post-stratification estimator on H strata /EPSE(-ff)/ 

EPSE(2) versus EPSE(4) versus EPSE(6) versus 



Population 




a 


HTE 


PSE(2) 


REG 


HTE 


PSE(4) 


REG 


HTE 


PSE(6) 


REG 


PS 


steep 


0.25 


3.62 


1.01 


1.15 


4.33 


0.93 


1.37 


4.59 


0.98 


1.44 


variable 


flat 


0.50 


1.42 


1.02 


0.88 


1.52 


1.00 


0.95 


1.51 


0.98 


0.94 


quad 


steep 


0.25 


1.07 


1.00 


1.00 


1.13 


1.01 


1.06 


1.14 


1.00 


1.06 




flat 


0.50 


1.08 


1.00 


1.00 


1.12 


1.01 


1.04 


1.10 


0.99 


1.02 


bump 


steep 


0.25 


2.32 


0.94 


0.61 


3.46 


0.95 


0.91 


3.85 


0.94 


1.03 




flat 


0.50 


1.73 


0.96 


0.77 


2.22 


1.00 


0.99 


2.49 


0.99 


1.11 


jump 


steep 


0.25 


1.11 


0.99 


0.97 


1.21 


0.99 


1.05 


1.32 


1.01 


1.15 




flat 


0.50 


1.05 


0.99 


0.98 


1.25 


0.97 


1.16 


1.29 


1.04 


1.20 


expon 


steep 


0.25 


1.25 


0.99 


0.87 


1.27 


1.00 


0.89 


1.29 


1.00 


0.89 




flat 


0.50 


1.14 


1.00 


0.96 


1.18 


1.00 


0.99 


1.17 


1.00 


0.98 


cycle 1 


steep 


0.25 


3.98 


0.99 


1.67 


4.80 


0.97 


2.02 


5.14 


1.02 


2.14 




flat 


0.50 


2.39 


0.97 


1.29 


2.47 


0.98 


1.34 


2.74 


1.01 


1.48 


cycle 4 


steep 


0.25 


1.04 


1.00 


0.97 


1.12 


1.03 


1.05 


1.19 


0.99 


1.12 




flat 


0.50 


1.04 


1.00 


0.97 


1.04 


0.99 


0.98 


1.18 


0.92 


1.11 


white 


steep 


0.25 


1.06 


1.00 


1.00 


1.04 


1.01 


0.98 


1.04 


1.00 


0.97 


noise 


flat 


0.50 


1.06 


1.00 


1.00 


1.05 


1.00 


0.99 


1.02 


1.00 


0.96 



Numbers greater than 1 favor EPSE. Based on logistic model post-stratification in 1000 replications of simple random sampling of size 
n = 200 from a fixed population of size N = 1000. Replications in which at least one stratum had fewer than two samples are omitted 
from the summary: 42 reps for six strata and 2 reps for four strata on steep curve, a — 0.25. 
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Overall, these simulation experiments demonstrate that the practical ef- 
fect of first fitting a parametric model with a fixed number of parameters 
to the survey data before post-stratifying is small, even for relatively small 
sample sizes. Given the types of models actually being used for classifica- 
tion in forest inventory applications, it would be of interest to study the case 
of semiparametric and nonparametric classification models, either analyti- 
cally or via simulation, to assess the effect of a large number of unknown 
parameters that grows with sample size. 

APPENDIX 

We begin with design results, in which all probability computations are 
with respect to the sampling design and the covariate model, specified in 
Dl. To establish the design consistency of the EPSE, we begin with two 
lemmas. 

Lemma A.l. Under D4 and D5, the Horvitz-Thompson estimator 

t£U N % 

is mean square consistent in the sense that 

E[(&r - Vn) 2 ] -> asN^oo 
and hence design- consistent in the sense that for all e > 0, 

Pr{\y w -y N \ > e} -> 

as N — > oo . 

Proof. It suffices to show mean square consistency. Because the Horvitz- 
Thompson estimator is unbiased, 

E[(&r " VN?] 



n- 2 E 



ViVj 

1/2 



+ 



^r* AT 

N max iGUN J2jeU N -.j^i A ij \ 1/2 



- Ntc% N 



which converges to zero as N — > oo under D4 and D5. □ 
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Lemma A. 2. Assume that D1-D5 hold and fix h € {1,2, ... , H}. Define 

LJu) = I 1 ' */ m_1 ( r /i-i) < X K A + U ) <»™ _1 (t/i); 
* \ 0, otherwise, 

and L, t = Li(0). Define Q N1 (u) = N~ l Eie^^H " i<). Q^ 2 (u) = iV" 1 x 
Eiel/jv J {ies}^ rl (^(u) - Li) and Qat 3 (u) = iV" 1 £ ieC/jv yiI{i£s}' K i~ 1 ( L i( u ) ~ 
Li). Then for £ = 1,2,3 and for all e > 0, Pr{\Q Ne (\ - A)| > e} -> as 
N ^ oo. 



Proof. Consider (Ja^u); the arguments are similar for Qni( u ) and 
Qn2(u-)- Let e,5 > be given. Then 



Pr{|Qjv 3 (A-A)|> e } 

= Pr{|g 7V3 (A-A)| >e,||A- A|| > 5} 

+ Pr{|Q 7V3 (A-A)|> e ,||A-A||<o-} 



(15) 



<Pr{||A-A||>5}+Pr sup \Q N3 (u)\>e . 

lu:||u||<<5 J 

By D3, the first term converges to zero as N — > 00. Consider the second 
term: 

E [sup u: || u ||< 5 |QiV3(u)|] 



(16) 



Pr<j sup \Q N3 {u)\>e}< 

,u:j|u||<<5 



Now, using the fact that |x^u| < MS from Dl, 

\Li(u) -Li\ = l-^-ij^^j^A^rHrnl-xJul+^m-HTHl-x^x^^-HrH)) 

+ / {m- 1 (r h )<x; A< m - 1 (r h )-x' l u} ~ ^{m- 1 (r fc )-x<u<x^ A<m~l (r h )} I 

< l / {m- 1 (rh_ 1 )<x' l A<m- 1 (r h _ 1 )+A/<5} + ^{m- 1 (^^J-MkxJ A<m~i (r^.j )} 

+ ^{m- 1 (r h )<xa<m-i(r, l )+M<5} + ^{m" 1 (r fe )-M5<x^A<m- 1 (r h )} I 
= : Iij + /2i + ^3i + ^4i, 

which does not depend on u. Hence 



E 



sup |Qjv 3 (u) 

u:||u||<<5 



< E 



< E 



sup iV- 1 £ Mfei| Lt(u) _ Li 
*NI<* ieu N n * 



sup iV- 1 £ ^^(7 



u:||u||<<5 



i€U N 



7T, 



E 



AT 1 £ ^^(i 



i€U N 



7T, 
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= iV~ 1 E \yi\(hi + hi + hi + hi) 

i£U N 

I N 1/2/ 4 \l/2 

<UN-^yf) kTE4 • 

V %eu N J \ j=neu N 1 

By D5, it suffices to show that the second term in this product converges to 
zero as N — > oo. Now 

= G N x{m~ l (T h -i) + MS) - G NX (m" l {T h ^)) 
= GW™~ Vh-i) + MS) - Gxim- 1 ^) + MS) 
+ G\(m~ 1 (Th~i) + MS) - G x (m~ l (r h ^)) 
+ G A (m- 1 (r h _i)) - GNxim'^Th-!)) 
< 2 sup \G NX (z)-Gx(z)\ 

+ {Gxim-^T^+MS) - Gxim- 1 ^))}. 

The sup term goes to zero as N — > oo by Dl, and the remaining term in 
curly braces can be made arbitrarily small because 5 > was arbitrary and 
m _1 (T/ l _i) is a continuity point of Gx- Similar arguments can be applied for 
hi, hi and It therefore follows that Pr{sup u .|| u || <5 |Qtv3(u)| > e} — > as 
N — > oo, so the desired result follows from (15). □ 

Proof of Result 1, design consistency of EPSE. Define 
L h = {x i :m~ 1 (T h -. 1 ) <x-A<m~ 1 (r h )} 

and define Lh similarly, with A replaced by A. For fixed h E {1,2, . . . ,H}, 
define 



F h = Gx(m 1 (r h )) - Gx(m 1 (r h -i)), g(wi,w 2 ) 



Wi 



W 2 + Fh 



W 2Nh = N- 1 £ I^t^hies}*- 1 ~ F h , W 2Nh = iV" 1 E h^L h} ~ F h 
WsNh = N~ 1 E yitfreU} 1 ^*}^ 1 ' W 3Nh = N- 1 E Vi^eL h }- 



ieu N ieu 



N 



Note that g(-) is continuous for w 2 ^ —Fh and that W\Nh and Winh are 
bounded by 1. Choose 5 € (0,Fh), where Fh > by D2. Then 

P*{\W 2Nh \ >S} = PrUGNxim-^Th)) - G N x (m- 1 ( Th -i)) - F h \> 5}, 
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which goes to zero as N — > oo by Dl. Combining Lemma A.l and Lemma 
A. 2 we have that for all 5 > 0, Pr{\WkNh — WkNh\ > —* as N — > oo for 
k = 1, 2, 3, so 

PrflWbwl > 5} < Pr{|W2JVh " > 5/2} + PrflW-wfcl > 5/2} -» 

as iV — > ex). Now given any e > 0, 

P?{\g(W 1Nh ,W 2Nh )W 3Nh - g(W 1Nh ,W 2Nh )W 3Nh \ > e} 

< Pr{\W 2Nh \ >5} + Pv{\W 2Nh \ > 5} 

+ Pv{\g(W 1Nh , W 2Nh )W 3Nh - g(W 1Nh , W 2Nh )W 3Nh \ > e, 

\W 2Nh \<5,\W 2Nh \<5}. 

From above, the first and second probabilities go to zero as N — > oo, so 
consider the third term. Write g = g(Wuyh,W 2 ]\rh) and g = g(WiNh,W 2 Nh) 
and note that 

gW 3Nh - gW 3Nh = g{W 3Nh - W 3Nh ) + (g - g)W 3Nh , 

so that the third term in the probability statement above is bounded by 

Pr{\g\\W 3Nh - W 3Nh \ > e/2, \W 2Nh \ < 5} 

+ Pv{\g - g\\W 3Nh \ > e/2, \W 2Nh \ < S, \W 2Nh \ < 5}. 

Now g(wi,w 2 ) is uniformly continuous and bounded between zero and (Fh — 
on the set where \w 2 \ < 5 < F^, so the first term converges to zero 
as N — > oo by Lemma A. 2. For the second term, first note that |W3jv/i| < 
N~ 1 E 1&Un \Vi\ < C^" 1 !^!/?) 1 / 2 , whi di is finite by D5. Then, by uni- 
form continuity of g there exists 5 e > such that for all N, 

Pr{\g(W 1Nh ,W 2Nh ) -g(W 1Nh , W 2Nh )\\W 3Nh \ > e/2, \W 2Nh \ < 5, \W 2Nh \ < 5} 

<Pr{\\(W 1Nh ,W 2Nh ) - (W lNh ,W 2Nh )\\>5 £ }, 

which converges to zero as N — > oo by Lemma A. 2. Since the above results 
hold for each h £ {1, 2, . . . , H}, it follows that 



Pr 



^(A)-iv- 1 ]T m 

i&U N 



> e 



(17) 



Pr- 



E 



(Wi Nh W 3Nh Wi Nh W 3 



Nil 



^^KWwu + Fu W 2Nh + F h 
^0 as N 
and the result is proved. □ 



> e 



oo, 
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Proof of Result 2, central limit theorem for EPSE under 
SUPERPOPULATION model. In what follows, the probability mechanism 
is the joint distribution for (x^,yi) as given in model assumptions Ml, M2 
and M4. In particular, all expectation, probability and order in probability 
statements are with respect to this superpopulation model. Let K(X) be a 
neighborhood of A which is bounded away from 0, and consider 7 € K(X). 
For £g {0,1,2}, both ^4jvw(7) an d -A n M(7) are [/-statistics with kernel 
yfl{ Th _ 1 <m(-y'x i )<T h } an d common expectation 0^(7) given in (9). We will 
apply Theorem 2.8 of Randies [8] to derive asymptotic approximations for 
A-Nheiy) an d Anhed) f° r 7 £ which requires checking that Conditions 

2.2 and 2.3 on page 465 of Randies [8] hold. 

Condition 2.2 is immediate by M3. To verify Condition 2.3, we need to 
establish that (2.4) and (2.5) on page 465 of Randies [8] hold. Let D(-y,d) C 
K(X) be a sphere of radius d centered at 7, and let G D(j,d). Then the 
supremum random variable in (2.4) of Randies [8] is 

SUP \yiI{r h _ 1<m (0'^)<T h } -yi J {r fe _ 1 <m(Yx i )<T fc }l 

eeD{~f,d) 

(18) t 

-\yi\ SUP \ I {T h ^ 1 <m{e'yi i )<T h }- I {T h - X <m{-y'M)<r h }l- 
0eD(-y,d) 

Since E(|y^|x;) < 1+k\ /A +k\ /2 +K X =: K 2 for t = 0, 1,2 and 4 by hypoth- 
esis, it suffices to look at the difference of the indicators in (18). Similarly 
to the reasoning in the proof of Lemma A. 2, 

\I{Th-i<m(0'x-i)<T h } ~ h T h-i<m(-y'xi)<T h }\ 

(19) < Vx i <m- 1 (T ( ,)<fx j } + / {7x I <m-i(rh_ 1 )<0'x. i } 

+ / {Yx l <m-i(r h )<0'x I } + / {0'x l <m- 1 (r h _ 1 )<yx I }- 

We now bound this sum by maximizing the indicated events with respect 
to 0, subject to the constraint that is in the closure of D(~f,d). Let 
inf = arginfegD^fl'xj and sup = argsup 06£ ,( T|d) 0'x*. Then, inf and 
sup must occur on the boundary of the sphere D(~y,d), since any point 
on the interior of the sphere has nonzero derivative for the linear func- 
tion 0'xj. Optimizing 0'xj subject to d 2 = (7 — 0/(7 — 0), we have that 
(0inf , S up) = (7 — ^ x i/|| x -i|l) 7 + ^ x i/ll x i||)> so that the sum of the indicators 
in (19) is bounded above by 

I{-t'xi-d\\x i \\<m- 1 {T} i )<-r'xi} +- f {7'x i <m- 1 (T/ l _i)<-y'xi+ ( f||x j ||} 

(20) 

+ J {yx i <m- 1 (T) l )<'y / x i +rf||x i ||} + ^Yxi-dllxiH^m-ifa-O^yxi}- 
Consider the first of these four indicators. Define 



G~y(t) = Pr{ 7 ' Xi < t} 
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X2, ■ ■ • , x p ) c?X2 • • • dx p > ds 



' —I 

-oo |7i| I 



(assume without loss of generality that 7J, the first element of 7, is not 
zero). Since ||xj|| < M < 00 with probability 1, 

E ( / {Yx l -d||x l ||<m-i(r h )<yx I }) < E {I{-y'x i -dM<m- 1 (T h )<'y'K i }) 

= G 7 (m- 1 (r h ) + dM) - G^mT 1 ^)) 

for some constant K3, using the mean value theorem and the continuity 
and compact support of /. Arguing in this fashion for the remaining three 
indicators in (20) establishes (2.4) of Randies [8]. 
Next, 



lim E 

rf-tO 



1 1 T 1 T 

SUP \yi I {T h ^ 1 <m(0'^ i )<T h } ~ yj-'{Th_ 1 <m(-7'x l )<T h }l 
0&D{~i,d) 



limE 



Vi SU P \ 1 {T h -x<7rie'x i )<Th} ~ 1 {Th-i<m(-y'x i )<r h }\ 

eeD(-y,d) 



- d ^ K2E ^ I {-y'^~d\\ Xl \\<m-^(T h )<' f ' Xl } + ^Yxi^m-ifo-i^Vxi+dllxil 



+ / {7'x l <m- 1 (^)<Yx l +d||x l ||} + - Z '{7'xi-d||x i ||<m-i(77 l _i)<7'x i }] 



= 



by the previous linear bound on the expectation, so that (2.5) of Randies 
[8] is satisfied. It follows from Randies' Theorem 2.8 that 

(21) A NM {\) = a M (X) + A NM {\) - a M {\) + o p (A^ 1/2 ), 



(22) 



AihiW = + A nhe (X) - a M {\) + o p (n 



-l/2s 



Define a h = A Nh0 (X) - A nh0 (X) and b h = A Nhl (X) - A nhl (X). Straight- 
forward calculations show that 

Cov(a h ,a k ) = ^(l - ^j(a h0 (X)I {h=k} - a h o(X)a ko (X)), 



n 



Cov(a h ,b k ) = -( 1 - — ) (aL hl (\)I{ h=k y - a h0 (X)a kl (X)), 



from which it follows that ah = O p (n l / 2 ) and bh = O p (n 1 ' 2 ). Also note 
that ahi(X) — ahe(X) = o p (l) by M3 and M4, and that 



A NM {X)-a M {X)=O p {N l ' 2 ) and A nh£ (X) - a M (X) = O p \ 



n 



-1/2n 



24 F. J. BREIDT AND J. D. OPSOMER 

by the central limit theorem. 

Since y^ = J2h=i AatmXt) f° r anv 7> we have that 

mo\ « m\ - v-^ f ANh Q (\)A n hi(X) — A nh0 (X)A Nh i(X) \ 

(23) ^ y {X)-y N = ^l ■ 

h= l ^ AiM)(A) ) 



Substituting (21) and (22) in the numerator of the summand in (23), we 
apply the order results above to obtain 

A-NhoWA-nhlW — A n ho(X)A Nh i(\) 

= (a h0 (\) - a h0 (X))(A nhl (X) - A Nh i(X)) 
- (a h i(X) - a h i(X))(A nh0 (X) - A Nh0 (X)) 
+ A Nh0 (X)A nhl (X) - A nh0 (X)A Nhl (X) + o p (n" 1 / 2 ) 



(24) 



A Nh0 (X)A nhl (X) - A nh0 (X)A Nhl (X) + o p (n 



-l/2> 



= (A Nh0 (X) - a h0 (X) + a h0 (X))(A nhl (X) - a hl (X) + a h i(X)) 

- (A nh0 (X) - a h0 (X) + a h0 (X)) 

x (A Nhl (X) - a hl (X) + a hl (X)) + o p (n~ 1 / 2 ) 

= a h i(X)(A Nh0 (X) - A nh0 (X)) 

+ a h0 (X){A nhl (X) - A Nhl (X)) + Opin' 1 / 2 ), 

where we have used the facts that An^X) and -Aat/^A) are O p (l) by the 
weak law of large numbers. 

From (22), the denominator of the summand in (23) is A n ho(X) = aho(X) + 
o p (l), and so 

A nm {X) aho(A) 

since aho(X) > by M4. 

Substituting (24) and (25) into (23), we have 

fi y {X)-y N = f2\^^r(A Nh0 (X) - A nh0 (X)) - (A Nhl (X) - A nhl (X))\ 
^t^ a ^o(A) J 

(26) 

+ o P (n" 1/2 ), 

so that the asymptotic distribution is the same as that obtained when A is 
known. 

It remains to derive this asymptotic distribution. Applying the central 
limit theorem to (26), we have that the limiting distribution of the EPSE 
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error is normal with mean zero. Using earlier covariance computations, and 
the fact that J2h=i bh = Vn — Vn, it follows that the variance of the leading 
terms in (26) is approximated by 



Var(/} y (A) - y N 

(27) 



~ ( 1 

n 



+ Var(y ff - y N ) 



n 

Note that, by definition of expectation given an event 



c*hoW 



: E(y i |r /l _i < m(A'xi) < r h ) 



and 



H 

E (Vi) = Q! /io(A){Var(yj|r/ l _i < m(A'xj) < r h ) 

h=l 

+ [E(y i \T h - 1 <m{\'x i )<T h )} 2 } 
from which the variance given in Result 2 immediately follows. □ 

Proof of Result 3, consistent estimation of EPSE variance un- 
der superpopulation model. Note that A Nh g(X) -h> au{X) and A nh i(X) 
a^(A) as n, N — > oo by the weak law of large numbers, and a/^(A) a^(A) 
by continuity of «w(0 for t = 0,1,2. Using (21) and (22) of the Appendix, 
the term in curly braces in (14) converges in probability to 

H 



h=l 



«fe2(A) _ / Qfel(A) 

a h0 {\) \a hQ (X) 



from which the result follows by Slutsky's theorem and Result 2. □ 
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